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Abstract —Stochastic neural networks such as Restricted Boltz¬ 
mann Machines (RBMs) have been successfully used in applica¬ 
tions ranging from speech recognition to image classification. 
Inference and learning in these algorithms use a Markov Chain 
Monte Carlo procedure called Gibbs sampling, where a logistic 
function forms the kernel of this sampler. On the other side of the 
spectrum, neuromorphic systems have shown great promise for 
low-power and parallelized cognitive computing, but lack well- 
suited applications and automation procedures. In this work, we 
propose a systematic method for bridging the RBM algorithm 
and digital neuromorphic systems, with a generative pattern 
completion task as proof of concept. For this, we first propose 
a method of producing the Gibbs sampler using bio-inspired 
digital noisy integrate-and-fire neurons. Next, we describe the 
process of mapping generative RBMs trained offline onto the 
IBM TrueNorth neurosynaptic processor - a low-power digital 
neuromorphic VLSI substrate. Mapping these algorithms onto 
neuromorphic hardware presents unique challenges in network 
connectivity and weight and bias quantization, which, in turn, 
require architectural and design strategies for the physical real¬ 
ization. Generative performance metrics are analyzed to validate 
the neuromorphic requirements and to best select the neuron pa¬ 
rameters for the model. Lastly, we describe a design automation 
procedure which achieves optimal resource usage, accounting for 
the novel hardware adaptations. This work represents the first 
implementation of generative RBM inference on a neuromorphic 
VLSI substrate. 

Index Terms —Generative model, neuromorphic VLSI, Re¬ 
stricted Boltzmann Machine, spiking digital neuron, Gibbs sam¬ 
pling. 


(EEG) data feature learning and classification [5], [6]. RBMs 
are generative learning algorithms and are particularly useful 
in extracting features from unlabeled data (i.e. unsupervised 
learning) [7]. Structurally, an RBM is a stochastic neural 
network composed of 2 layers of neuron-like units: a layer 
of visible units v which are driven by the real-world data of 
interest and a layer of hidden units h which form connections 
to these visible units. There are no interconnections within 
a layer and the weights of connections between layers are 
symmetric. Eig. la exemplifies an RBM with 4 visible and 3 
hidden units. 

The RBM defines a joint probability over the input data and 
hidden variables specified by the Boltzmann distribution [8]: 

e-^Kh) 

f(v4.) = (1) 

where E^(v,h) = — v^Wh — b^v — b^h. Here p denotes 

the Boltzmann probability distribution and is a function 
(also known as the “energy function”) of v and h, where v 
denotes the binary state (0 or 1) of the visible units and h 
represents the binary state of the hidden units. The weight 
between visible and hidden units is represented by W, while 
by and bh represent the biases of v and h, respectively. The 
denominator is the sum of all possible states of visible and 
hidden units, also known as the partition function. 


I. Introduction 

D eep Learning algorithms such as Restricted Boltzmann 
Machines (RBMs) and Deep Belief Networks (DBNs) 
have been successfully used in a wide range of cognitive 
computing applications such as image classification [1], speech 
recognition [2], [3], and motion synthesis [4]. Additionally, 
these algorithms have been explored as possible solutions for 
Brain-Computer Interfaces (BCI) and electroencephalography 
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Eig. 1: RBM and DBN representations, (a) An RBM formed 
by 4 visible and 3 hidden units, (b) Gibbs sampling procedure 
in an RBM. (c) A DBN formed by stacking RBMs. 


Inference in an RBM can be performed using a Markov 
Chain Monte Carlo (MCMC) procedure called Gibbs sam- 
















THIS MANUSCRIPT HAS BEEN SUBMITTED TO IEEE TBIOCAS EOR REVISION IN OCTOBER 2015 


2 


pling, where each unit in any given layer is sampled condi¬ 
tioning on its total input from units in the other layer. Fig. lb 
illustrates k steps of MCMC performed in an RBM. The 
Gibbs sampling rule in binary RBMs is defined by the logistic 
function, 

a{x) = l/{l + e-^), (2) 

with the probability of activation of unit i as defined by [8] 

p{xi = l\xj) = aC^WijXj + bi), (3) 

j 

where Wij is the weight from unit j to unit i for all j ^ 
layer{i), and bi is the bias of unit i. DBNs are formed by 
stacking layers of RBMs (Fig. Ic) and it has been shown 
that inference in a DBN can be done in a successive layer- 
by-layer manner on each RBM [9]. RBMs and DBNs can 
be used with labeled data for classification tasks either as 
feature extractors to an external classifier or as completely self- 
contained discriminative machine learning frameworks [1]. 
However, most of the data in the real world is unlabeled and, 
in such situations, RBMs and DBNs can be used to perform 
generative inference tasks. Applications of inference in such 
unsupervised frameworks include, for example, restoration 
of incomplete or occluded images and prediction of motion 
sequences. 

Currently, inference tasks using RBMs and DBNs are over¬ 
whelmingly realized in software, which are typically run on 
high performance CPUs (Central Processing Unit) and GPUs 
(Graphical Processing Unit). For ultra low-power, real-time 
realizations of these algorithms, such as in mobile devices, 
the solution tends to be sending information to the cloud for 
processing. However, this demands, in many cases, reliable 
communication between client and server, along with large 
amounts of transmitted data. In this context, the Neuromorphic 
Computing paradigm is a more suitable solution in terms of 
low-power client-side processing. Neuromorphic VLSI (Very 
Large Scale Integrated Circuit) systems [10]-[17], inspired by 
biological neural architectures and functions, have been re¬ 
alized with analog, digital, and mixed-signal circuit elements. 
Such systems typically compute in a massively parallel fashion 
and communicate asynchronously using spikes. The principal 
benefit of this architecture, which stands in contrast to the 
traditional von Neumann computing paradigm, is extremely 
energy efficient computation in a highly concurrent fashion. 
Algorithms which demand large matrix multiplications, such 
as RBMs and DBNs, benefit greatly in terms of computation 
(and, consequently, power) when implemented in spike-based 
systems, mainly because multiplications by zero are avoided 
(i.e. absence of spikes does not generate computation). There¬ 
fore, arrays of spiking neurons realized on neuromorphic VLSI 
are ideal for classification, generation and other inference tasks 
in the context of real-world high dimensional data. 

The goal of our work is to develop a modular architecture in 
a systematic fashion to form a foundation for building neural 
networks, such as RBMs and DBNs, on substrates of digital 
spiking neurons. As a proof of concept of our design approach, 
we implement a pre-trained (i.e. trained offline) generative 


RBM for pattern completion on the TrueNorth digital neu¬ 
romorphic VLSI device using the MNIST handwritten digit 
images dataset. 

The remainder of this paper is divided in the following 
manner: Section II describes the Markov chain analysis of 
the digital neural sampler; Section III describes the TrueNorth 
system and the challenges in implementing Deep Learning 
algorithms, along with the necessary steps for mapping the 
RBM algorithm onto digital spiking neuromorphic hardware; 
Section IV discusses quality metrics and the impact on gen¬ 
erative performance when using the digital neural sampler 
and sparse network connectivity; Section V shows the devel¬ 
oped 3-stage RBM architecture and the generative model on 
TrueNorth; Section VI illustrates the spike processing flow in 
the TrueNorth RBM; Section VII details the design automa¬ 
tion procedure for optimal hardware utilization; Section VIII 
presents the results of the physically-implemented TrueNorth 
generative RBM; and the last section discusses conclusions 
and future work. 


IT Markovian analysis of the digital neural 

LOGISTIC SAMPLER 

The kernel of the MCMC procedure for inference in an 
RBM is the Gibbs Sampler and involves sampling from a 
logistic function (Eq. (3)) [18], [19]. More specifically, it 
involves sampling from a Bernoulli distribution (defining the 
state of the RBM unit, x in Eq. (3)) parameterized by a 
logistic function (activation probability). Traditional methods 
for realizing a logistic sampler in hardware demand a look¬ 
up table or functional approximation for the sigmoid [20]- 
[22], which is then compared to the output of a pseudo¬ 
random number generator. On the other hand, in spiking neural 
hardware, such as TrueNorth, the only computational primi¬ 
tives are neurons. Since sigmoidal activation functions are not 
inherently present in TrueNorth, we therefore have to make use 
of the deterministic and stochastic neurodynamical properties 
of the system for efficient realization of the logistic sampler. 
Below we describe the process of Gibbs sampling using digital 
spiking neurons in a Markov chain framework, which is useful 
for better understanding the sampler behavior and serves as 
a means for producing the generative performance metrics 
detailed in Section IV. The solution neatly combines producing 
the logistic function and sampling the state of the RBM unit. 

In [23] it was shown that a digital integrate-and-fire neu¬ 
ron with a uniformly-sampled threshold combined with a 
Bernoulli-sampled leak can approximate a logistic spiking 
probability for the corresponding RBM unit. Here we expand 
on this by providing a Markov chain analysis of the discrete¬ 
time neural sampler. The neural sampling procedure is initial¬ 
ized by setting the neural membrane potential (U^) to a value 
equivalent to the argument of the logistic function (which is a 
function of the weights, bias and unit states, as shown in Eq. 
(3)). Afterward, the system uses three neural variables (two 
stochastic and one deterministic) to produce an approximate 
sigmoidal spiking probability. These variables are explained 
next. 
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1) Stochastic leak. The stochastic leak is an integer 
value added to the membrane potential, and is sampled from 
Bernoulli trials with p = 0.5. In other words, at every time 
step (“tick”), either the membrane potential remains the same 
or it is incremented by the leak value (L). This type of leak 
is inspired from the TrueNorth system, whose neurons can 
be configured with stochastic non-voltage-dependent leak. For 
our setup, we chose to use a positive leak, however TrueNorth 
neurons can take on positive or negative leak values. The 
TrueNorth system will be explained in detail in Section III. 

2) Stochastic threshold. The stochastic threshold is an 
integer sampled from a uniform distribution between Vth and 
Vth^TR; the term TR stands for “threshold range”. At every 
tick, the membrane potential of the neuron is compared with 
the stochastic threshold and, in case the potential hits (i.e. 
is equal to or exceeds) the threshold, a spike event will be 
generated. 

3) Sampling time window. The deterministic component 
of the sampler is the sampling time window (T^), which is 
the number of ticks during which the neuron is observed. The 
operation of the sampler during Ts is the following: 

a. If during Ts the neuron hits the threshold at least once, 
a spike event after Ts is produced. 

b. Even if the neuron hits the threshold more than once 
during Ts, the sampler must still produce a single spike 
at the output after Ts. 

c. If no threshold events occur during Ts, then no spike 
event is produced at the output of the sampler after Ts. 

A. Sampling algorithm using digital neurons 

The algorithm, using TrueNorth-based I&F neurons with 
stochastic leak (L) and threshold (Vth_rand)^ for realizing the 
sigmoidal sampling rule (Eq. (3)) to perform MCMC sampling 
in RBMs is given below. 

~ ^init 

spiked = 0 

repeat 

v;, = Kn+ B(0.5)*L 

Vth_ rand — \J{Vth,Vth+TR) 

if (Vrn ^ ^thjrand)* Spikcd 1 

until Ts steps'. 

The term B(p) represents a Bernoulli sample (0 or 1) 
with probability p and U(a, 6) is an integer sampled from a 
uniform distribution between a and h (both inclusive). The 
membrane potential (Ym) is initialized to Vinit and, during 
the “repeat” cycle, if Vm crosses the threshold (equivalent to 
Vm ^ ythjrand)^ the Spiked Variable will be set to 1, after 
which it will remain in this state until the end of the Ts 
time steps. Therefore, the state of spiked after Ts ticks will 
produce a sample (given the initial membrane potential) from 
an approximate sigmoidal spiking probability distribution. The 
state of a sampled RBM unit using this algorithm is equal 
to the state of spiked. How to generate the spiked variable 
using TrueNorth neurons will be explained in Subsection III-B, 
along with implementation details in Section V. Next, we 


will analyze the effect of the stochastic neural variables using 
discrete-time Markov chains. 

B. Adaptation of neural variables into Markov chains 

Since we are dealing with a discrete-time digital system, 
the stochastic neural variables can be modeled as coupling be¬ 
tween two discrete-time Markov chains (DTMC): a stochastic 
leak DTMC and a stochastic threshold DTMC. The sampling 
time window determines how many steps should be taken in 
these chains. Each state in a chain is the instantaneous value 
of the membrane potential. Due to a limited number of bits 
for data representation in the digital system, saturation levels 
should be taken into account. Eor illustrative purposes, in our 
examples we consider only positive leak values, implying that 
only the positive saturation level will come into effect, as any 
data point (i.e. membrane potential) beyond it will be clipped 
to the saturation value. 

The sampler operates by first initializing the stochastic leak 
DTMC at the state which represents the initial membrane 
potential value, and then taking alternate steps between the 
stochastic leak and the stochastic threshold DTMCs. Different 
initialization values of the stochastic leak DTMC yield dif¬ 
ferent sigmoidal probabilities. Both DTMCs present the same 
number of states, defined by the membrane potential range. 
In terms of structure, the DTMCs will always present states 
representing lower-valued membrane potentials to the left of 
the chain, and consequently the rightmost state represents 
membrane potential equal to Vgat- Next, we discuss the effect 
of the three neural properties on the DTMCs. 

1) Stochastic leak DTMC. Since the stochastic leak 
chosen for our examples causes only non-negative change in 
the membrane potential, the only possible transitions, at each 
stochastic leak tick, from a state are: (1) to itself (in the event 
of no leak occurrence) or (2) to the right (positive additive leak 
occurrence). Eigure 2 shows the general case of the DTMC 
for the stochastic leak (L). The number next to each state 
transition is the transition probability (set to 0.5 for all states). 
Note how no value of membrane potential can surpass Vsat^ 
which makes the state representing this specific membrane 
potential an absorbing state [24], [25]. Since it is the only 
absorbing state in the chain, it is called the terminating state 
in a terminating DTMC. Also note this state will be reached by 
more than one other state (not considering the self-connection) 
when L > 1. 


Eig. 2: DTMC for stochastic leak in the neural sampler. 

2) Stochastic threshold DTMC. The stochastic threshold 
is sampled, at each stochastic threshold tick, from a uniform 
distribution, which produces a linearly increasing transition 
probability from states inside the range Yth • + TR\ 







THIS MANUSCRIPT HAS BEEN SUBMITTED TO IEEE TBIOCAS EOR REVISION IN OCTOBER 2015 


4 


to the spiking state. Figure 3 shows the general case of the 
DTMC for the stochastic threshold. Note how values outside 
the range previously described are guaranteed not to hit the 
threshold when V < Vth (realized by the self-connections) 
and guaranteed to hit the threshold when V > {Vth + TR) 
(realized by the connections to Vsat)- For simplification, in 
the figure the symbol A = (TR +1). 



Fig. 3: DTMC for stochastic threshold in the neural sampler. 


To transform spikes into probabilities, we must produce a 
single spike event after Ts in case the neuron reached the 
threshold during Ts. This can be obtained by using the Vsat 
state as the terminating state also for the stochastic threshold. 
A two-fold effect is produced by this terminating state: (1) the 
two DTMCs become coupled by using a common terminating 
state; and (2) the sigmoidal firing probability can be extracted 
directly from the terminating state in the stochastic threshold 
DTMC after Ts, as will be shown below. 

3) Sampling time window. The deterministic component 
of the sampler, the sampling time window, defines the number 
of steps taken in the Markov chains and is a two-phase 
process. The first phase occurs in the leak DTMC, where 
a new membrane potential value is assigned to the neuron. 
The second phase is the evaluation of the newly-as signed 
membrane potential in relation to the noisy threshold. This 
entire process is considered one step in the coupled DTMCs. 
In case the system resides in the terminating state, Vsat, of the 
coupled DTMCs after Ts, an ultimate single spike event will 
be produced; if the system is in any other state, no spike event 
will be produced. This results in spike events sampled from 
the sigmoidal spiking probability, conditioning on the starting 
state (Vinit) of the procedure. 


C. Matrix representation of Markov chains 

A terminating Markov chain presents a single absorbing 
state, also known as the terminating state; all the other states 
are transient. The transition probability matrix - with rows 
representing origin states and columns representing destination 
states - of a terminating Markov chain can be defined in the 
following manner: 


T [t°| 

00 


( 4 ) 


In matrix P, the m x m transient-states transition matrix is 
represented by T, the row-vector 0 represents the terminating 
state’s non-transient transitions, and (/^—T)1 = T®. Therefore, 
the entire transition matrix P can be characterized by simply 
knowing T. 


1) Stochastic leak DTMC. The stochastic leak is charac¬ 
terized by the additive leak value (L). The leak DTMC can be 
defined by the transition matrix Pi in Eq. (5). The colors of the 
matrix components represent the same individual components 
as in Eq. (4). 
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( 5 ) 


2) Stochastic threshold DTMC. The stochastic threshold 
is characterized by the base threshold value (Vth) and the 
threshold range (TR). The threshold DTMC can be defined 
by the transition matrix Pth in Eq. (6). The colors represent 
the same individual components of Pth as in Eq. (4). 


> 

-Vsat + 1 

£ 

> 

> 

Vth + 1 

vth + TR 

vth + TR 

> 

> 

1 

0 •• 

• 0 

0 

0 •• 

• 0 

0 •• 

• 0 

0 

0 

1 •• 

• 0 

0 

0 •• 

• 0 

0 •• 

• 0 

0 

0 

0 •• 

• 1 

0 

0 •• 

• 0 

0 •• 

• 0 

0 

0 

0 •• 

• 0(1 

-i) 

0 •• 

• 0 

0 •• 

• 0 

1 

A 

0 

0 •• 

• 0 

0(1 


• 0 

0 •• 

• 0 

2 

A 

0 

0 •• 

• 0 

0 

0 

1 

A 

0 •• 

• 0 


0 

0 •• 

• 0 

0 

0 •• 

• 0 

0 •• 

• 0 

1 

0 

0 •• 

• 0 

0 

0 •• 

• 0 

0 •• 

• 0 

1 

0 

0 •• 

• 0 

0 

0 •• 

• 0 

0 •• 

• 0 

1 


( 6 ) 


3) Spiking probability: coupled DTMCs and sampling 
time window. To obtain the sigmoidal spiking probability, 
the two transition matrices must first be coupled to produce 
Pc = Pi Pth- The spiking probability can now be obtained by 
computing Psampie = , which represents Ts steps taken 

in the coupled DTMC. With this, the last column (terminating 
state) of the final matrix will contain the spiking probability, 
Pspike, of each initial membrane potential (rows in the matrix). 
Therefore, 


Pspike(^i) Psample(^is‘^^sat 3 “ 1 )? 


( 7 ) 
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where Si is the origin state corresponding to the initial mem¬ 
brane potential of the neuron prior to sampling. 

D. Example 

The example, shown in Fig. 4, illustrates the sampler 
obtained using the previous calculations and compared with 
actual stochastic neuron simulations. The x-axis represents the 
membrane potential (Ym) of the neuron at the start of the sam¬ 
pler operation. As can be seen, the neural sampler obtained via 
the coupled DTMC computation (blue line) and the stochastic 
simulation (averaged over 10^ samples for each initial Vm) 
of the neuron (blue circles) are overlapping. Besides this, the 
results from the DTMC computation approximate the ideal 
sampler scaled by a factor of 10 (red line) with considerable 
precision. 

Since the logistic function in Eq. (2) naturally presents a 
dynamic range between -6 and -\-6, and because the TrueNorth 
system deals only with integer-valued membrane potentials, 
the scaling factor is a means of increasing the resolution of 
the neural sampler. In other words, by “streching out” the 
function, each integer increment in the initial value of the 
membrane potential represents a smaller step in the function, 
resulting in higher resolution. To realize this using the digital 
neuron model described thus far, the appropriate values of 
Ts, Vth, TR, and L must be chosen. In Subsection III-B, 
when physically implementing the neural sampler algorithm in 
the TrueNorth system, the use of the scaling factor is further 
discussed, and a quantitative analysis of the TrueNorth neural 
sampler versus the ideal sampler for parameter selection is 
detailed in Section IV. 



Fig. 4: Ideal sampler versus DTMC computation and neural 
simulation. 


The noise sources of the stochastic leak and threshold 
are, respectively, Bernoulli and uniform. By applying these 
noise sources in a single tick, it is not possible to obtain 
the precise S-shape of Fig. 4; only straight lines could be 
obtained. Therefore, an explanation for the “curved” part of 
the sigmoid (around Vm equal to -35 and -\-35) is the non-linear 
behavior produced by the temporal aspect of the sampler (T^). 
Throughout multiple ticks, the combination of these “linear” 
noise sources results in a more non-linear curve by creating 
shorter segments from the straight lines. 

E. Considerations 

The problem was analyzed for a positive additive stochastic 
leak, yet the same would be possible with a negative leak. The 


main detail is that the last column of Pth, originally considered 
the terminating state, would be able to be transitioned out of 
due to the stochastic leak in the next step of the Markov chain. 
Also, the terminating state in Pi would be the first column, 
which does not “line up” with Pth- 

On the other hand, if a negative leak is applied, though 
not sufficient for a chain in the rightmost state to reach a 
state below the maximum value of the threshold (i.e. below 
VthPTR), then the correct spiking probability can be obtained. 
In this manner, even if the leak causes a transition to the left 
in Pi, the following iteration of Pth will force the system to 
return to the rightmost state. Interestingly, the coupled activity 
of the two DTMCs can preserve the original terminating state, 
even if it is not the terminating state in Pi. 

Discrete phase-type distributions (DPTDs) [26] are very 
similar in nature to the developed neural sigmoid sampler. The 
main difference is that DPTDs result from a system of one or 
more inter-related and sequentially occurring geometric distri¬ 
butions, while the neural sampler results from a combination 
of geometric (leak as Bernoulli trials) and uniform (threshold) 
distributions. 

Lastly, the digital neural sampler is an elegant solution 
for sampling from a logistic function by not only using bio¬ 
inspired neural dynamics but also simultaneously realizing two 
operations: computing the spiking probability and sampling 
to obtain the new state of the unit. The DTMC presented 
can be very useful when simulating the network dynamics: 
the neuron’s transition operator can be extracted by simply 
accessing the spiking probability curve obtained from the 
DTMC. This removes the demand of having to simulate every 
step of the neuron during the sampling time window (Ts), 
which comes in handy during the analysis of the generative 
performance of the sampler in Subsection IV-B. 

III. Approaches for Deep Learning 
ON TrueNorth 

Neuromorphic substrates present unique challenges for cre¬ 
ating spiking versions of machine learning algorithms due 
to data precision and network connectivity constraints. In 
this work, a step-by-step methodology for porting RBMs 
and DBNs onto the IBM TrueNorth system is detailed. The 
MNIST dataset, consisting of 28x28 pixel grayscale images of 
handwritten digits 0 through 9, was chosen for the generative 
inference task. For our experiments, the images were binarized 
to zero-one values for adaptation to the neuromorphic scenario. 
The following subsections present the TrueNorth system and 
outline the approaches and quantitative analysis of the algo¬ 
rithm adaptations necessary for mapping the (offline-trained) 
networks. 

A. The TrueNorth digital neurosynaptic processor 

IBM’s TrueNorth is a very low-power, brain-inspired dig¬ 
ital neurosynaptic processor [16], with 4096 cores, totaling 
1 million programmable spiking neurons and 256 million 
configurable synapses (Fig. 5a). The core is the basic building 
block of the system, each composed of 256 axons (inputs) and 
256 neurons (outputs) (Fig. 5b), connected via a 256 x 256 
crossbar of configurable synapses (Fig. 5c). Each neuron can 
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target its generated spikes to any axon on the chip, limited 
to one axon per neuron, and presents over 20 individually 
programmable features, including threshold, leak, reset, and 
stochastic properties. From the user’s point-of-view, neurons 
operate in 1 ms time steps, during which asynchronous spike 
event transmission and processing occurs between and inside 
the cores. Therefore, during each 1 ms interval, spikes are 
delivered to and processed in their destination cores, after 
which a global clock aligns the generation of the next set 
of spikes. 





Fig. 5: The TrueNorth neurosynaptic processor: (a) chip lay¬ 
out, wafer, and chip package; (b) high-level view of the 256 
axons (inputs) and 256 neurons (outputs); and (c) internal view 
of the fully-configurable binary crossbar [16]. 


The digital integrate-and-fire (I&F) TrueNorth neurons 
present stochastic and deterministic leak and threshold proper¬ 
ties. A simplified representation of the dynamical behavior of 
the membrane potential Vj (t) for neuron j at time t is defined 
by the following set of (sequentially processed per neuron) 
equations [27]: 

' Vj{t) = Vj{t - 1) + Mt) Wij sf (8a) 

'* U(0 “ U(^) "1“ (1 ~ “I” 

if > aj + r]{Mj)), Spike and set Vj{t) = Rj (8c) 

The first line (Eq. (8a)) represents the synaptic integration 
of all active axons impinging on neuron j at time t. The term 
Ai{t) is the binary-valued input spike arriving from the 
axon at time t\ Wij is the binary-valued synaptic connection 
between axon i and neuron j ; and is the synaptic weight 
between axon i and neuron j. This last term is particularly 
interesting as each neuron presents four 9-bit signed integer 


configurable weights. Therefore, an axon can be configured to 
be one of four types, and this defines which of the four possible 
weight values - individually in each neuron it is connected to 
- will be integrated if the axon is active. 

The second line (Eq. (8b)) represents the leak integration, 
where Aj is a 9-bit signed integer. Depending on the value of 
Cj, the leak can be deterministic (cj = 0) or stochastic (cj = 
1). When Cj =0, the value of Aj is integrated in the membrane 
potential. On the other hand, when Cj = 1, the stochastic 
function F{Xj) = |Aj| > p defines if a leak of value sgn(Aj) 
is integrated; the value of p is a sampled uniformly distributed 
8-bit integer. In this manner, a stochastic leak can only take 
on values of -i-l or -1. However, the value of L in the digital 
neural sampler (refer to the algorithm in Subsection II-A) can 
take on much larger values. How to implement stochastic leaks 
greater than 1 on TrueNorth will be explained in Section III-B. 

The last line (Eq. (8c)) compares the integrated membrane 
potential with the threshold, which has a base value of aj and 
a uniformly sampled value of r]{Mj) ranging from 0 to 2^ — 1. 
Therefore, if Vj{t) is equal to or surpasses the threshold, the 
neuron spikes and its membrane potential is reset to Rj. Using 
the TrueNorth system as a basis for digital neural processing, 
the next section shows how an approximation to the Gibbs 
Sampler can be obtained using these neural properties. 

B. Gibbs sampling with TrueNorth neurons 

Neural sampling can be realized on the TrueNorth system by 
means of the algorithm described in Subsection II-A. The first 
step of the algorithm is to set the initial membrane potential 
of the neuron {Vinit in the algorithm) to the equivalent value 
of the argument of the logistic function. This is realized in 
TrueNorth by appropriately activating the axons of neuron 
j at time t = 1 to produce the desired membrane potential 
(1/^ (1) = Vinit in Eq. (8a)). The neuron is then free to run (no 
axon activity) during a sampling time window, T^, defined in 
number of 1 ms time steps (“ticks”), during which a stochastic 
additive leak is applied and the updated membrane potential 
is evaluated at every tick. If the neuron’s membrane potential 
is greater than or equal to the stochastic threshold (i.e. the 
neuron spikes) at least once during Ts, the binary state of the 
equivalent RBM unit is set to 1 (i.e. the RBM unit spikes). 

Eor adapting the algorithm in Subsection II-A to TrueNorth, 
the stochastic threshold can be directly modeled by setting the 
appropriate values of ar and Mr for TrueNorth neuron r. The 
stochastic leak, on the other hand, cannot be directly mapped 
for absolute leak values greater than 1. An alternative to this 
is to use an additional neuron I to act as the stochastic leak 
for neuron r. Eor this, the parameters of neuron I are set to 
Cl = 1, A/ = +128, ai = 1, Ml = 0, and Ri = 0. Therefore, 
neuron ni naturally spikes with probability p = A//255 « 0.5, 
because it leaks sgn(A0 = +1 with this same probability and 
the threshold is set to ai = 1. After spiking, it is reset to 
Vi = 0 and will present the same behavior in the next tick. If 
we then connect the output of neuron I to an input axon (of 
type i) of neuron r and set the memory position equal to 
the leak value L, we will obtain the desired spiking behavior. 

In the algorithm, the state of the RBM unit is equivalent 
to that of the spiked variable. However, in the TrueNorth 
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implementation multiple spikes may be produced by sampling 
neuron r during Ts. A solution for this is to create a so-called 
“refractory effect” using an additional neuron k. What this 
additional neuron essentially does is count how many spikes 
are received from neuron r. For this, neuron k is configured 
with threshold ak = I and its membrane potential set at the 
start of the sampling phase (i.e. at the same moment neuron 
r is set to Vinit) to, for example, —Ts. By incrementing the 
membrane potential by 1 for every received spike from neuron 
r, the membrane potential of neuron k will only be larger than 
—Ts if at least one spike was received. Therefore, after Ts 
has expired, we inject an axonal event of TTs into neuron k, 
which, due to the unit-valued threshold, will cause it to spike 
if at least one spike was produced by neuron r during the 
sampling phase. 

In sum, the outset of a single RBM unit can be analyzed as 
capturing the dynamical behavior of two coupled DTMCs run 
for a time interval of Ts, which are used to set a threshold flag 
(spiked). Two TrueNorth neurons (r and 1) are used to form 
the DTMCs, along with a third neuron (k) used for verifying 
if the threshold fiag has been triggered, after Ts has expired 
(thus, the “refractory effect”). The combination of all of this 
comprises an RBM unit. 

As a final note, it was shown that the argument of the 
logistic function (x in Eq. (2)) is modeled as the membrane 
potential of the neuron. Since TrueNorth neural membrane 
potential takes on only signed integer values, and the logistic 
function has a dynamic range between approximately -6 and 
+6, it is necessary to apply a multiplicative scaling factor, s, 
to the RBM weights and biases to increase the dynamic range 
of the neural logistic sampler realization. As a result of this 
scaling, the neural sampler must be realized with appropriate 
values of Ts, Vth, M, and L to enable the RBM to sample with 
high precision from the logistic probability distribution. The 
ideal sampler (with s = 50) is compared with the TrueNorth 
realization (Ts = 8, stochastic threshold ranging from 79 to 
590, and stochastic leak of 49) in Fig. 6. 



Fig. 6: Logistic sampler using TrueNorth neurons. 


C. Sparse connectivity 

The all-to-all connectivity between layers in the RBM 
algorithm implementation has to be adapted to the available 
connectivity in hardware. The 256-input cores in TrueNorth 
present a constraint for the 784-pixel images used in the 
hand-written digits pattern completion application described 
in this paper, where each hidden unit, in a standard RBM 


implementation, is connected to all 784 visible units. A viable 
solution is to use a patching scheme over the original image 
[28], thus reducing the area of the image “observed” by each 
hidden unit. Reciprocally, since the generative RBM presents 
feedback from the hidden layer to the visible layer, the quantity 
of hidden units should also be selected in a way as to reduce 
the number of units “observed” by the visible units. Fig. 7a 
shows how a patch (yellow) is formed by an 8x8 pixel window 
over a binarized MNIST image. 



Fig. 7: Sparsity structure in RBM. (a) Illustration of an 8x8 
pixel patch (in yellow), (b) Sparsity can be seen as applying a 
mask over the network’s weight matrix during offline training. 


In [28], square patches of size pxp were randomly placed 
over the input image, with all the visible units belonging to a 
patch connected to a single hidden unit. Though this resulted 
in reduced network connectivity, for a physical implementa¬ 
tion with fan-in constraints a systematic patching scheme is 
necessary to produce a well established maximum number 
of units observed in each layer. The systematic patching is 
particularly important for the generative model due to the 
feedback from hidden to visible units during inference (details 
of generative RBM operation are given in Section V). If, for 
example, patching were performed randomly, a visible unit 
could possibly be captured by more than 256 patches, making 
this fan-in unfeasible on TrueNorth. Therefore, we applied an 
overlapping, yet deterministic, patching scheme developed for 
the generative RBM realization. The method uses patches with 
p^ pixels which are formed by “sliding” a square window over 
the -pixel image and forming a new patch at every new 
position. The total number of overlapping patches produced 
using this method is defined by: 


patches = hidden units = (N — p-\-l)^. (9) 

A patching scheme implemented in this manner can be in¬ 
terpreted as applying a mask, Wmask^ over the RBM’s weight 
matrix, where O’s and I’s in the mask represent, respectively, 
no connection and presence of connection between visible and 
hidden units. The mask is applied during the offline RBM 
training and the resulting sparse weight matrix is then used 
for mapping the RBM onto TrueNorth. In Fig. 7b, the sum of 
column values in each row of Wmask represents the number of 
hidden units observed by each visible unit, and the sum of row 
values in each column represents the number of visible units 
observed by each hidden unit. With this systematic patching. 
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the bounds of the sums (both in rows and columns) are well- 
defined. 

IV. Quality metrics of digital neural sampler and 

SPARSE NETWORK 

The following subsections present quantitative analyses of 
the impact of the adaptations demanded during the mapping of 
the original RBM algorithm onto TrueNorth. First, the impact 
of quantization due to digital hardware data representation is 
verified. Second, the effect of approximate logistic sampling 
using the digital neural sampler is analyzed. Lastly, due to 
non-viability of all-to-all connections between neurons in 
TrueNorth, we analyze the impact of sparsity in network 
connectivity. For the neuromorphic adaptations, the generative 
performances are verified using the Kullback-Leibler (KL) 
divergence and Annealed Importance Sampling (AIS), which 
are briefly explained next. 

Kullback-Leibler divergence. KL divergence is a measure 
of the difference between probability distributions. The prob¬ 
ability distribution of an RBM is defined by Eq. (1), with 
the denominator of this equation (i.e. the partition function) 
demanding a countable normalizing sum of all state probabil¬ 
ities for computation. Therefore, since we want to compare 
the performance of the samplers versus exact probability 
distributions (computed by Eq. (1)), only small networks with 
tractable partition functions can be analyzed. KL divergence 
is particularly important for our analysis of the digital neural 
sampler and, though we cannot directly extrapolate values of 
this measure to larger networks, the results aid in identifying 
expected performance for each sampler. KL divergence is 
defined by the following equation [29]: 

Dkl{P\\Q) =^P{i)\og^y (10) 

where P and Q are two probability distributions, and Dkl 
is always non-negative. The state of the system is defined 
by i. For our experiments, P was defined as the distribution 
obtained in the experiment and Q as the true distribution. 

Annealed Importance Sampling. AIS is a metric used 
to estimate the log-probability of a generative model [30], 
[31], where larger values indicate higher likelihood that the 
model generated the data. For high-dimensional models, such 
as RBMs, where calculation of the partition function is in¬ 
tractable, the AIS algorithm is very useful as it performs 
a stochastic estimation of the partition function to compute 
the log probability of the model with respect to the data. 
Therefore, the AIS algorithm will be used for validating the 
generative performance of the sparsely connected network by 
verifying the patch size which produces the largest AIS value. 

A. Quality of data quantization 

The effect of data quantization can be verified by comparing 
the quantized samplers to the ideal. The weights and biases can 
be quantized by realizing the following: multiply their values 
by a scaling factor (s), then round the result to the nearest 


integer, and finally divide the second result by s. The KL 
divergence of the network with quantized versus exact (high 
precision) weights was computed over 1000 experiments, each 
consisting of randomly generated weights and biases for a 
network with 5 visible and 5 hidden units. For these, based 
on experimental results of weights and biases from previously 
trained RBMs, the values were sampled from the following 
normal distributions: weights ^ A^(—0.05,1.fie — 3), visible 
biases ^ A^(—0.3,1), and hidden biases ^ A^(0.5, 2.25). The 
KL divergence results of quantized versus non-quantized data 
are shown in Fig. 8, including a box plot of KL divergence for 
5 = 15 —100. A saturation point can be seen around 5 = 50. It 
is important to note that very large values of s are beneficial for 
the algorithm, however they can be costly in terms of hardware 
resources (cores, in the case of TrueNorth), since more neurons 
and longer accumulation times will be required for mapping 
larger values of weights and biases (explained in Section V). 



Fig. 8: Generative performance versus quantization. 


B. Quality of the digital neural sampler 

The scaling factor impacts the resource usage of the system 
and it also impacts the latency - by increasing the accumu¬ 
lation time (see Section VI). The other parameter which also 
affects latency is the sampling time window (Ts). Thus, when 
neural samplers present the same generative performance, the 
selected configuration will naturally be the one presenting the 
lowest Ts. Additionally, when selecting the neuron param¬ 
eters (the “sampler configuration”), one important aspect of 
TrueNorth neurons that should be taken into account is the 
membrane potential range. From Eq. (8c), we can observe that 
the upper bound (i.e. positive saturation) value of membrane 
potential is defined by the sum ajPr]{Mj). Therefore, for our 
analysis, we have chosen configurations with this sum close 
to or surpassing the upper bound of the dynamic range of the 
sigmoid function (?^ fix scaling factor) while still presenting 
adequate sigmoid fitting. 

To begin the neuron parameter selection, we first fix the 
scaling factor, then the quality of the digital neural sampler 
can be verified by sweeping over values of sampling time 
window (Ts) and neuron parameters (Yth, M, and L) to, 
ideally, overlap with the sigmoid. The best fit was found for 
each Ts value by performing a parameter search to reduce the 
mean squared error (MSB) between the ideal (scaled) logistic 
function and the curve produced by the neural sampler. For 
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our experiments, five configurations were chosen, with the 
TrueNorth neuron parameters of the configurations (G1-G5) 
shown in Table 1. The MSE of each versus the ideal logistic 
function is presented in the rightmost column. 


Config. 

Scaling factor 

Ts 

Vth 

M 

L 

MSE 

Gl 

50 

1 

0 

7 

125 

0.4878 

G2 

50 

2 

0 

8 

100 

0.1311 

G3 

50 

4 

66 

8 

77 

0.0741 

G4 

50 

8 

79 

9 

49 

0.0412 

G5 

50 

16 

186 

9 

36 

0.0415 


TABLE I: Neuron configurations for neural sampler analysis. 


The generative model performance for these configurations 
was determined by means of average KL divergence of the 
model, and also the ideal logistic sampler (using Eq. (3)), 
versus the true distribution (computed by Eq. (1)), over 10 
randomly sampled networks (5 visible and 5 hidden units), 
with 15 experiments run for each network, and each experi¬ 
ment consisting of 10^ samples. Eig. 9 shows the average KL 
divergence results of the different parameter configurations. 
The smaller plot in this figure is a boxplot of the 150 (10 
networks x 15 experiments per network) KL divergence 
values at sample 10^. Naturally, the configurations with lower 
MSE also presented lower KL divergence, with G3 and G4 
practically overlapping. 



As was mentioned in the end of Section II, the DTMC 
computations of the neural sampler can be very useful when 
simulating the network dynamics. Instead of having to simu¬ 
late every step of the neuron during the sampling time window 
(T^), we can simply use the spiking probability curve obtained 
from the DTMC as the neuron’s transition operator. In other 
words, the probability of spiking after Ts can be extracted 
from the curve and this value is then compared to a uniformly- 
sampled number between 0 and 1. Though this does not affect 
in any sense the operation of the neural sampler algorithm 
(and cannot be used in practice), it speeds up simulations 
considerably. 

A comparison of the normalized (i.e. all values divided by 
the worst case = model Gl) MSE and KL divergence (at 
sample 10^) is shown in Eig. 10. Though the results for both 


measures were not identical - for example, the MSE for G4 
and G5 were basically identical, yet the KL divergence for 
G5 showed a slight improvement -, the figure clearly shows 
similar trends for both measures. Thus, these results indicate 
that using the DTMC analysis of the sampler combined with 
the MSE measure can be a powerful tool for quick access to 
estimating the generative performance of a sampler. Eor the 
generative RBM implementation on TrueNorth, configuration 
G5 was chosen due to slightly better KL divergence results. 



Neuron parameter configuration 

Eig. 10: MSE and KL divergence of neural Gibbs samplers. 

C. Quality of the sparse network 

Since sparsity is difficult to evaluate in small networks, the 
generative qualities of the sparse RBM were verified by means 
of the AIS measure of a network with 784 visible and 500 
hidden units, pre-trained using the MNIST dataset. In a related 
vein, reference [28] shows how a sparsely connected RBM can 
produce a more noise-tolerant model for classification. Eor our 
application, sparsity is actually necessary for reducing the fan- 
in of each neuron, which, in TrueNorth, is limited by the 256- 
input cores. The patching scheme proposed for the generative 
RBM, described in Subsection III-C, takes into account the 
feedback from hidden to visible units. With this method, as 
illustrated in Eig. 7b, the patch dimension (p) defines the 
maximum number of connected units in both directions (i.e., 
visible ^ hidden and hidden ^ visible). 

AIS measure versus patch dimension results are shown 
in Eig. 11. Eor low p values, lower log-probabilities were 
produced on account of less information captured by each 
patch. Eor large p values, the log-probability is also lower 
on account of less number of hidden units (refer to Eq. (9)) 
in the network. Given the performance results of the model, 
for the generative RBM implementation an optimal patch size 
of 8x8 was chosen, resulting in (A^-p-i-1)^ = (28-8-1-1)^ = 441 
hidden units. 



Eig. 11: Generative performance versus sparsity. 
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Fig. 12: TmeNorth RBM. (a) The 3-stage architecture used to distribute spike events (splitter), produce the desired membrane 
potential, and realize the sigmoid sampling, (b) The generative model structure, formed by combining two 3-stage blocks 
and including the feedback between layers, (c) Example of a pattern completion task where the digit “6” is incrementally 
reconstructed. 


V. Generative RBM architecture on TrueNorth 

The generative RBM was mapped on TrueNorth by devel¬ 
oping a modular 3-stage architecture, where each combination 
of these three stages represents the transition between RBM 
layers. The diversity of configurable parameters present in 
TrueNorth is critical to the realization, with particular neuron 
types, connectivity strategies, and reset modes in each stage. 
The physical constraints of TrueNorth - particularly 256 axons 
and neurons per core, only 1 destination axon per neuron, and 
4 distinct weights per neuron - defined the design flow of the 
RBM. The architecture, composed of stages (1) refractory- 
and-splitter, (2) quantization and (3) accumulate-and-sample, 
is illustrated in Fig. 12a. 

The generative application implemented was a pattern com¬ 
pletion task of a corrupted MNIST image. The signal fiow 
in the TrueNorth RBM is illustrated in Fig. 12b, where each 
row is a 3-stage module and the blue and red blocks represent 
information related to visible and hidden units, respectively. 
Note the second stage in each module contains both colors, 
since this is the transition between visible and hidden layers, 
i.e. where the arguments of p{v\h) and p{h\\) are computed. 
Finally, the data fiow for the application is represented in 
Fig. 12c. For this task, part of an image of the digit “6” (not 
used during training) was removed, and the figure shows the 
first 3 reconstructions based on the partial data. 

A. Stage la: Splitter 

Stage 1 serves a dual role in the system: (1) a splitter for 
input signals in the respective RBM layer and (2) a refractory 
effect of the neuron. Since each RBM unit in the visible/hidden 
layer is connected to multiple units in the hidden/visible layer, 
along with the fact that TrueNorth neurons present only one-to- 
one connections (i.e. each neuron can only target a single axon 
on the entire chip), a signal splitter is necessary to create the 
RBM’s one-to-many connections. Therefore, stage 1 generates 
the required number of replicas of an RBM unit to be used 
in the quantization stage. Fig. 13a illustrates a splitter core 


used for generating the necessary number of replicas of each 
of the visible units. The neurons are set to unit thresholds and 
all synaptic connections are of weight equal to -i-l, which will 
cause the neurons to spike whenever an axon event arrives. 
The refractory effect function of this stage and the two control 
signals (C+ and C-) are discussed later in Subsection V-D. 

B. Stage 2: Quantization 

In TrueNorth, the weight of connections between axons and 
neurons can be configured with two constraints: the weights 
between axons connected to a given neuron are allowed to 
have only 4 different values; and each axon is configured as 
one of 4 types, refiecting on which of the 4 weights will be 
used for the connection between the axon and the respective 
neuron [27]. The first constraint limits the number of different 
possible weights, while the second limits the “reutilization” 
of axons between neurons. This is because an axon can be 
used amongst two neurons only if the weight stored in each 
neuron’s memory position - defined by the axon type - is 
the desired synaptic weight for each of these connections. 
Several methods proposing the usage of low-precision weights 
and biases in artificial neural networks have been developed 
[32], [33], however these methods target only discriminative 
models. As was observed in Figure 8, a large scaling factor (i.e. 
high precision) is critical for obtaining satisfactory generative 
performance in RBMs. Since the precision and diversity of 
weights and biases demanded by the generative RBM cannot 
be directly represented by the TrueNorth memory structure, 
a quantization stage is therefore necessary to realize the 
connectivity between RBM units. 

The representation of individual RBM weights and biases 
was achieved by using a collection of neurons in stage 2, each 
with its own weight, which together can produce the desired 
membrane potential (i.e. the equivalent argument of cr(x)). For 
this, linear-reset, unit-threshold neurons are used [27], and they 
operate by decrementing their membrane potential by 1 every 
time they spike, continuing to do so while the value is above 
zero. In this manner, the collective activity of many stage 2 
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Fig. 13: Example of TmeNorth RBM stages: (a) Refractory-and-Splitter, (b) Quantization, and (c) Accumulate-and-Sample. 


neurons encodes the RBM weight/bias, while stage 3 will be 
used to accumulate the spikes from these many neurons into 
a single neuron. 

The quantization of weights and biases is done by selecting 
a maximum accumulation time (Ta), which will be the largest 
value of membrane potential a stage 2 neuron can reach. 
In other words, every input spike into stage 2 axons will 
charge the membrane potential of each quantization neuron 
up to at most Ta, after which they will freely operate, with 
spiking activity guaranteed to cease in a maximum of Ta ticks. 
Fig. 13b exemplifies a stage 2 core with Ta= 8 and visible units 
vi, V 2 , and vs connected to hidden unit hi with weights +7, 
-12, and -2, respectively. Since Ta is a user-defined value, 
intuitively we would select the lowest value possible as to 
reduce the overall latency of the system. However, depending 
on the number of weights to be mapped and their specific 
values, attempting to use smaller values of Ta will exceed the 
number of available neurons in a core. Note that the sign for 
the negative weights is actually positive, for stage 2 only takes 
into account the intensity (absolute value) of the connection 
between units, independent of being excitatory or inhibitory. 
The actual sign of the connection is taken care of in stage 3. 
Lastly, since bias values are independent of neuronal activity, 
these are realized by sending an external spike event to the 
bias axon {hh^=l0 in Fig. 13b) each time the sum of inputs 
to a given RBM unit neuron is to be computed. 

C. Stage 3: Ac cumulate-and-Sample 

Stage 3 is used to accumulate the activity of the quantization 
neurons into a single neuron, which will then be sampled 
(as described in Subsection III-B). Prior to accumulation, the 
membrane potentials of the stage 3 neurons are initialized 
to zero. Then, during the first time window (Ta), stage 3 
neurons accumulate spikes from stage 2 neurons to form a 
membrane potential equivalent to the argument of the logistic 
function. The neurons used in stage 3 have a non-resetting 
property to prevent clearing the membrane potential during the 
accumulation phase. This is followed by the time window T^, 
during which the stochastic threshold and leak properties of 
the neuron are used for sampling from the logistic probability 
distribution. During this second time window, if the neuron’s 
membrane potential surpasses the threshold, the neuron may 
spike multiple times since it is configured as non-resetting. 


For the spikes to correctly represent a sample from the 
logistic function, the refractory stage is necessary to register 
a maximum of 1 spike event per sampling window, and is 
described in the next subsection. 

An example of a stage 3 core crossbar configuration is 
shown in Fig. 13c, with the sign of the RBM weight/bias now 
included in the synaptic weights. Note the use of recurrent 
connections from additional neurons to realize the stochastic 
leak. These additional neurons are necessary because an inter¬ 
nally generated stochastic leak (for example, in neurons hi, 
h 2 , and hs) can only assume an absolute value of 1. Since our 
digital neural sampler implementation usually demands larger 
values of L, the “leak” neurons were created with threshold 
of 1 and internal stochastic leak sampled from a Bernoulli 
distribution with p « 0.5 (refer to Subsection III-B). In this 
manner, there is approximately 50% chance of these “leak” 
neurons spiking at each tick, thus generating a spike event to 
their respectively associated neuron (hi, h 2 , etc.), which can 
be connected with a user-defined synaptic weight of T > 1. 

D. Stage lb: Refractory effect 

As was discussed in Subsection III-B, for the multiple 
spikes from the accumulate-and-sample stage to be converted 
to a single spike event - which represents a sample from 
the logistic probability distribution -, stage 1 neurons were 
configured to produce a “refractory effect”. What essentially 
occurs in stage 1 is a delayed propagation of the spiked 
variable (in the digital neural sampler algorithm in Subsection 
II-A), whose value is dependent on spikes from the previous 
RBM layer’s stage 3. This delayed response after Ts has 
expired, therefore, results in a “frame alignment” (in the same 
1 ms time step) of RBM unit samples to subsequent layers and 
guarantees precise operation of the generative RBM algorithm. 

The refractory effect is obtained in TrueNorth by con¬ 
figuring stage 1 splitter neurons with a negative saturating 
membrane potential. The membrane potential of stage 1 neu¬ 
rons are initialized to the negative saturating value C_, with 
\C-\ > Ts, at the start of the sampling phase of stage 3 
in the other RBM layer. Every incoming spike in stage 1 
will cause the membrane potential of its associated neuron to 
increase by 1. After T^, the membrane potential of the stage 1 
neurons are incremented by (= \C-\), causing the neurons 
which received at least one spike to cross the threshold and 
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simultaneously generate (“frame alignment”) a spike to the 
subsequent stage 2. 

VI. Spike processing flow in TrueNorth RBM 

An example of the spiking activity flow between RBM 
layers is shown in Fig. 14, where the following parameters 
were used: T5'=10, Ta= 8, C_=-30, stage 3 stochastic threshold 
ranging between 10 and 17 and stochastic leak of +3. In the 
example, the x-axis denotes time (in 1 ms ticks) and the ^-axis 
denotes the value of the membrane potential (Vmem)- The blue 
line is the neuron’s membrane potential, the solid red line is 
the saturation level, the dashed red line is the threshold, and 
the red circles represent spike events. 


Stage 1 Stage 2 Stage 3 



Fig. 14: Example of spike processing flow. Stage 1 realizes two 
functions: refractory effect in (a) and (b), and splitter (spike 
distribution) in (c). Stage 2 quantizes the weights between 
RBM layers in (c) and (d), producing the desired membrane 
potential for stage 3 to sample from. Stage 3 accumulates the 
spikes from stage 2 linear-reset neurons between (e) and (f), 
and the sampling procedure is performed between (f) and (g). 

The sequence of events (the letters) presented in Fig. 14 are 
detailed below: 

(a) Time=2. Stage 1 neurons are initialized to C_, after 
which they begin accumulating spikes from stage 3 neurons 
of the other RBM layer, (b) Time=10. After Ts, the 
signal is applied, and every neuron which captured at least 
one spike from stage 3 neurons crosses the ttireshold=l. 

(c) Time=ll. As is applied, spike events from stage 1 
neurons are transmitted to stage 2 axons. In this example, 
the stage 2 neuron is charged to a membrane potential of 6. 

(d) Time=12-17. The linear-reset stage 2 neurons continuously 
produce spike events to stage 3 axons until their membrane 
potentials return to zero, (e) Time=13. At this moment, the 
stage 3 neurons begin accumulation for Ta ticks, (f) Time=21. 
After the stage 3 neurons have accumulated their membrane 
potentials to the desired values, the sampling phase begins. 
The stage 1 neurons of the other RBM layer are initialized 
to C_; the stochastic threshold and leak come into effect at 
stage 3. (g) Time=31. After Ts ticks, the stage 3 neurons are 
reinitialized. 

The example shows the complete sampling procedure of 
an RBM layer in the TrueNorth implementation. In stage 2, 
weights and biases are converted from membrane potential 
values to spikes, which are accumulated in stage 3 until 
the appropriate membrane potential is formed (i.e. has been 
grouped into a single neuron) at the start of the sampling 
phase. The two TrueNorth neurons in stage 3 comprise the 
coupled DTMCs used in the neural sampler. The stage 1 neu¬ 
rons produce the delayed spike response (“refractory effect”) 


in the subsequent RBM layer, which constitutes a sample from 
an RBM unit. Therefore, in this example, a new sample is 
produced in an RBM layer at every {Ta + + 2) = 20 ticks; 

2 additional ticks are necessary for control signals. The entire 
process of producing a new sample of the visible units - the 
output of the generative RBM - would then take 2 x 20 = 40 
ticks (= 0.04 seconds). 

VII. Design automation 

TrueNorth system configuration can be realized using the 
object-oriented Corelet Language, which is an abstraction for 
representing the network of neurosynaptic cores [34]. The 
developed design automation procedure consists of creating 
systematic data structures, originating from the RBM weight 
and mask matrices, RBM biases, and user-defined parameters, 
which include: accumulation time (T^); sampling time (T^); 
data scaling factor (s); and sampler stochastic threshold 
and leak. Once these have been defined, the automation 
procedure produces an optimal configuration of cores which 
minimizes the number of axons and neurons used for the 
RBM realization. Three optimization strategies were created, 
where strategies 1.1 and 1.2 are mutually exclusive, yet 
they can be combined with strategies 2 and 3. Note that all 
considerations for hidden units are also valid for visible units. 

Strategy 1.1: The first strategy involves establishing the 
number of neurons required for mapping each RBM weight 
and bias. Without optimization in stage 2, the number of 
neurons rij used when quantizing the weight between the 
visible units observed by hidden unit hj can be computed by 
Uj = This direct method of mapping weights 

and biases does not take into account the fact that possibly 
many stage 2 neurons present low weights, which will cause 
them to complete spiking (during the accumulation phase) 
before neurons which represent higher values, such as weight 
Ta- Since the network must always go through Ta ticks during 
the accumulation phase, it would be more efficient to try to 
connect a given neuron to as many possible axons, provided 
the total synaptic weight is guaranteed not to exceed Ta- In 
the limiting case, neurons which map weights -1 and -i-l can 
have up to Ta axons connected to them. 

Though this first strategy benefits the core utilization 
considerably, better optimizations are possible. This is 
because the order in which the RBM weights are chosen to 
be mapped in stage 2 is defined by the user, yet different 
mapping sequences may utilize less cores. For example, 
suppose Ta=^ and the weights to be mapped are 1 through 
6 for visible units vi through vq, respectively. If we were 
to map them in this order, a total of 6 neurons would be 
used (Fig. 15a). On the other hand, if we were to map in the 
reverse order (6 through 1), a total of 7 neurons would be 
necessary (Fig. 15b). Therefore, the order of weight mapping 
affects the core utilization. Since the possible number of 
weight orderings to be analyzed is intractable, better results 
can be obtained by using strategy 1.2. 

Strategy 1.2: In this strategy, the weights closest to a user- 
defined central weight value are mapped first. By sweeping 
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Fig. 15: Strategy 1.1 examples of stage 2 quantization. 

through all possible central weights, an optimal value can be 
empirically obtained. Fig. 16 shows the number of neurons 
used when mapping the weights -20 through 20 with = 4. 
The red line is the number of neurons (120) obtained with no 
optimization, while the black line is the number (110) when 
using sequential mapping with weight neuron “reutilization 
(i.e. strategy 1.1). The blue line shows the results for the 
central weight method (strategy 1.2). The reduction from 
110 to 107 neurons when using a central weight of 5, for 
example, is small (approximately 3%), though more significant 
reductions are possible when this strategy is combined with 
strategies 2 and 3. 



Fig. 16: Optimization strategy comparison in terms of number 
of neurons used to map the desired weights. 


Strategy 2: One final optimization can be performed 
in stage 2. When a hidden unit is mapped, the number 
of remaining neurons in the core may be enough to map 
additional hidden units. If this is the case, among the hidden 
units to be mapped, we select the one which has the most 
number of visible units in common with the hidden units 
previously mapped in the given core. This is because the 
patching scheme produces hidden units which may have some 
visible units in common and, thus, specific units are capable 
of sharing axons. This strategy is naturally also valid when 
mapping visible units. 

Strategy 3: Stages 1 and 3 can also be optimized via a 
greedy optimization method. For stage 1, the algorithm selects 
the unit which uses the most number of neurons (replicas for 
axons in stage 2), yet does not exceed the neuron limit in 
the core. If no unit can be mapped in the core, the algorithm 
creates a new one, until all units have been mapped. For stage 


3, the algorithm does the same as for stage 1, now selecting 
the unit which uses the most number of axons (quantization 
neurons from stage 2) without exceeding the axon limit in the 
core. 

VIII. Results 

For realizing the generative model - the MNIST pattern 
completion task detailed in Section V - on TrueNorth, an 
RBM with 784 visible units and 441 hidden units (generated 
by using 8x8 patches) was trained offline using the persis¬ 
tent Contrastive Divergence algorithm [35]. The generative 
application demands a sampler with high fidelity with respect 
to the ideal sampler. To achieve this, the parameters were 
selected according to the criteria outlined in Section IV: 
scaling factor = 50, Ts=l6, stochastic leak = 36, and stochastic 
threshold ranging from 186 to 697. The choice of the scaling 
factor directly impacts the RBM weight and bias magnitudes. 
To map these weights in stage 2, a trade-off is necessary 
between the accumulation time and the quantity of neurons and 
cores demanded by the application. Therefore, a compromise 
value of Ta= 32 was selected for the mapping. With these 
parameters, a new RBM image is sampled at every 100 ticks 
(= 0.1 seconds). 

A. Resource utilization and power estimate 

Using the automation strategies outlined in Section VII, the 
generative RBM was realized with 865 cores, representing 
21% of the total number of cores on TrueNorth. Table II 
shows how applying the optimization strategies 1.2, 2 and 3 
drastically reduced the core utilization. 


Case 

Strategies 

# of cores 

Chip utilization 

1 

none 

2956 

72.2% 

2 

1.1, 2, 3 

906 

22.1% 

3 

1.2, 2, 3 

865 

21.1% 


TABLE II: Core utilization results based on optimizations. 

In Figure 13, it was shown how each RBM unit is actually 
formed by 3 TrueNorth neurons (stochastic leak, stochastic 
threshold, and “refractory effect”). However, the final imple¬ 
mentation of the 784 + 441 = 1225 RBM units consisted of 
865 cores, with a total of 135k mapped TrueNorth neurons. 
This number is mainly due to the splitters and to the stages 
needed for weight and bias quantization, representing 82% of 
the total neurons used. Virtually all of the remaining neurons 
were used for the control signals for the system operation, 
while the digital neural samplers - which represent the RBM 
units per se - used up only 0.5% of the mapped TrueNorth 
neurons. In practice, this results in a ratio of 110 TrueNorth 
neurons required to implement each RBM unit, and shows how 
generative models implemented on high dimensional datasets 
incur a considerable overhead due to the aforementioned 
hardware constraints. Nonetheless, given the network size (784 

441 RBM units), image patch size (p = 8), and accumulation 
{Ta = 32) and sampling {Ts = 16) times, we conservatively 
estimate a power consumption of 5 mW for the optimized 
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TrueNorth generative RBM (case 3 in Table II). This results 
in an estimated 0.5 mJ of energy consumed to generate each 
MNIST image sample. 

B. Pattern completion outputs and metrics 

Example outputs of the pattern completion task are shown 
next. In Fig. 17a, one example output for each of the ten 
digits is presented: the first column is the original data (“O”), 
the middle column is the corrupted (“C”) image sent into the 
TrueNorth RBM, and the third column is the reconstructed 
(“R”) output after 50 RBM samples (= 5 seconds). These im¬ 
ages were chosen to represent positive results, while Fig. 17b 
shows images whose reconstruction was not ideal. Fastly, Fig. 
17c illustrates a sequence of reconstructions for a corrupted 
image of the digit “6”; the sample number is indicated above 
each image. A decent reconstruction sample could be obtained 
after about 4 seconds; however, with an earlier sample we 
could possibly confuse the “6” with a “5”. 
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Fig. 17: TrueNorth pattern completion task outputs. Positive 
(a) and negative (b) reconstruction results. Reconstruction of 
the digit “6” in (c), with the sample number indicated above 
each image. 

Depending on the percentage of image occlusion (“corrup¬ 
tion”), the RBM may or may not be able to reconstruct a 
satisfactory representation of the original image. Therefore, 
we performed experiments with different image occlusion 
percentages and measured Hamming distance (HD) - identical 
to the number of incorrectly reconstructed pixels in this case - 
at the 50th RBM reconstruction sample for 1,000 test images. 
The mean value of the HDs was normalized according to 
the number of non-occluded pixels in the image. The results, 
illustrated in Fig. 18, show that the reconstructive performance 
of the neural sampler nearly matches that of the ideal sigmoid 
sampler. 

Lastly, the mean HD for the TrueNorth RBM can be 
verified throughout the reconstruction process. In Fig. 19, 
convergence to the mean HD value for 35% image occlusion 
(dashed black line) occurs after about 10 RBM reconstruction 
samples. This result is important to define the practical time 



Fig. 18: Sampler generative performance analysis in terms of 
incorrectly generated pixels (Hamming distance). 

expenditure demanded for the generative task of MNIST image 
reconstruction. 


Normalized Hamming distance (with 35% occlusion) 



Fig. 19: Normalized Hamming distance during reconstruction 
of 35% occluded image on the TrueNorth RBM. 


IX. Conclusions and future work 

In this work, we have shown the first generative RBM im¬ 
plementation on neuromorphic hardware. For this, we followed 
a step-by-step procedure for producing the Gibbs sampling 
kernel - the sigmoidal spiking probability - using digital 
spiking neurons and for mapping the generative RBM algo¬ 
rithm onto a digital neuromorphic VLSI substrate. The neural 
sampler is an elegant solution as it uses bio-inspired dynamics 
to simultaneously incorporate the logistic function look-up and 
the comparison with a randomly generated number, which 
together represent a Gibbs sample. A discrete-time Markov 
chain (DTMC) analysis of the neural sampler was performed, 
resulting in a simplified method of obtaining the spiking prob¬ 
ability without the need for long neuron behavior simulations. 
The generative performance of the neuromorphic adaptations 
were then verified using the Kullback-Leibler (KL) divergence 
and the Annealed Importance Sampling (AIS) algorithm. We 
also showed how mean squared error (MSE), along with the 
DTMC, can be used as an efficient method for obtaining 
insight into the sampler quality. 
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In the TmeNorth system, we followed a systematic devel¬ 
opment and implementation process of a modular architecture, 
which can be used for realizing generative RBMs and DBNs 
on a substrate of digital neurosynaptic cores. The 3-stage 
architecture and the design automation procedure provide a 
path towards automated neural network applications on brain- 
inspired processors for more complex inference tasks, such 
as natural image recognition and time series generation. The 
modular characteristic of the architecture naturally lends itself 
to implementations of deeper networks (DBNs). Also, the 
architecture of stages 1 and 2 with the associated design 
automation procedure can even be used to realize other neural 
networks which are defined by sparse weight matrices. We are 
currently working on new algorithms which incorporate more 
of the hardware constraints during the training phase. 

The developed architecture avails many of the features 
present in neuromorphic systems. The spike fiow of the 3-stage 
architecture developed for the TmeNorth RBM uses spikes 
for communication between cores, propagating information 
between RBM layers. The number of computations is also 
reduced in the neuromorphic scenario as only non-zero mul¬ 
tiplications are performed (i.e. only when a spike occurs does 
data processing take place), which is contrary to what occurs 
traditionally for matrix multiplications in CPUs. Additionally, 
the sampler makes use of stochastic neural properties to pro¬ 
duce an approximate sigmoidal firing probability, necessary for 
the RBM sampling procedure. Despite these positive features, 
information processing in the network is somewhat sequential 
(i.e. basically two stages are being used at each instant), 
which is mainly a result of the limited weight values per 
neuron in the present hardware. Inspired by the sampling 
methods proposed in [36], [37], we are currently developing 
paths towards algorithms on TmeNorth which incorporate the 
hardware constraints yet present a more continuous fiow of 
spike processing for inference. 

As a final note, research proposing RBMs and DBNs 
as solutions to applications of BCIs and EEC classification 
generally focuses on discriminative models [5], [6], [38]. 
However, BCIs could naturally benefit from generative models, 
targeting applications such as time series EEC or neural signal 
reconstruction for artificial limb control. An attractive feature 
of spike-based neuromorphic processors for spike-based neural 
interfaces would be the direct match between the event-driven 
data formats of the artificial and biological neuronal networks 
at the interface, potentially obviating the need for extra signal 
processing to convert between spiking and mean-rate represen¬ 
tations, and possibly allowing to exploit the inherent temporal 
code of neuronal spike recordings or pulsed stimulation for 
further improvements in BCI performance. 
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